144 research outputs found
Hyperparameter optimization with approximate gradient
Most models in machine learning contain at least one hyperparameter to
control for model complexity. Choosing an appropriate set of hyperparameters is
both crucial in terms of model accuracy and computationally challenging. In
this work we propose an algorithm for the optimization of continuous
hyperparameters using inexact gradient information. An advantage of this method
is that hyperparameters can be updated before model parameters have fully
converged. We also give sufficient conditions for the global convergence of
this method, based on regularity conditions of the involved functions and
summability of errors. Finally, we validate the empirical performance of this
method on the estimation of regularization constants of L2-regularized logistic
regression and kernel Ridge regression. Empirical benchmarks indicate that our
approach is highly competitive with respect to state of the art methods.Comment: Proceedings of the International conference on Machine Learning
(ICML
On the Consistency of Ordinal Regression Methods
Many of the ordinal regression models that have been proposed in the
literature can be seen as methods that minimize a convex surrogate of the
zero-one, absolute, or squared loss functions. A key property that allows to
study the statistical implications of such approximations is that of Fisher
consistency. Fisher consistency is a desirable property for surrogate loss
functions and implies that in the population setting, i.e., if the probability
distribution that generates the data were available, then optimization of the
surrogate would yield the best possible model. In this paper we will
characterize the Fisher consistency of a rich family of surrogate loss
functions used in the context of ordinal regression, including support vector
ordinal regression, ORBoosting and least absolute deviation. We will see that,
for a family of surrogate loss functions that subsumes support vector ordinal
regression and ORBoosting, consistency can be fully characterized by the
derivative of a real-valued function at zero, as happens for convex
margin-based surrogates in binary classification. We also derive excess risk
bounds for a surrogate of the absolute error that generalize existing risk
bounds for binary classification. Finally, our analysis suggests a novel
surrogate of the squared error loss. We compare this novel surrogate with
competing approaches on 9 different datasets. Our method shows to be highly
competitive in practice, outperforming the least squares loss on 7 out of 9
datasets.Comment: Journal of Machine Learning Research 18 (2017
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization
Due to their simplicity and excellent performance, parallel asynchronous
variants of stochastic gradient descent have become popular methods to solve a
wide range of large-scale optimization problems on multi-core architectures.
Yet, despite their practical success, support for nonsmooth objectives is still
lacking, making them unsuitable for many problems of interest in machine
learning, such as the Lasso, group Lasso or empirical risk minimization with
convex constraints.
In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse
method inspired by SAGA, a variance reduced incremental gradient algorithm. The
proposed method is easy to implement and significantly outperforms the state of
the art on several nonsmooth, large-scale problems. We prove that our method
achieves a theoretical linear speedup with respect to the sequential version
under assumptions on the sparsity of gradients and block-separability of the
proximal term. Empirical benchmarks on a multi-core architecture illustrate
practical speedups of up to 12x on a 20-core machine.Comment: Appears in Advances in Neural Information Processing Systems 30 (NIPS
2017), 28 page
HRF estimation improves sensitivity of fMRI encoding and decoding models
Extracting activation patterns from functional Magnetic Resonance Images
(fMRI) datasets remains challenging in rapid-event designs due to the inherent
delay of blood oxygen level-dependent (BOLD) signal. The general linear model
(GLM) allows to estimate the activation from a design matrix and a fixed
hemodynamic response function (HRF). However, the HRF is known to vary
substantially between subjects and brain regions. In this paper, we propose a
model for jointly estimating the hemodynamic response function (HRF) and the
activation patterns via a low-rank representation of task effects.This model is
based on the linearity assumption behind the GLM and can be computed using
standard gradient-based solvers. We use the activation patterns computed by our
model as input data for encoding and decoding studies and report performance
improvement in both settings.Comment: 3nd International Workshop on Pattern Recognition in NeuroImaging
(2013
Average-case Acceleration Through Spectral Density Estimation
We develop a framework for the average-case analysis of random quadratic
problems and derive algorithms that are optimal under this analysis. This
yields a new class of methods that achieve acceleration given a model of the
Hessian's eigenvalue distribution. We develop explicit algorithms for the
uniform, Marchenko-Pastur, and exponential distributions. These methods are
momentum-based algorithms, whose hyper-parameters can be estimated without
knowledge of the Hessian's smallest singular value, in contrast with classical
accelerated methods like Nesterov acceleration and Polyak momentum. Through
empirical benchmarks on quadratic and logistic regression problems, we identify
regimes in which the the proposed methods improve over classical (worst-case)
accelerated methods.Comment: Since last version, we simplified proof of Theorem 3.
Second order scattering descriptors predict fMRI activity due to visual textures
Second layer scattering descriptors are known to provide good classification
performance on natural quasi-stationary processes such as visual textures due
to their sensitivity to higher order moments and continuity with respect to
small deformations. In a functional Magnetic Resonance Imaging (fMRI)
experiment we present visual textures to subjects and evaluate the predictive
power of these descriptors with respect to the predictive power of simple
contour energy - the first scattering layer. We are able to conclude not only
that invariant second layer scattering coefficients better encode voxel
activity, but also that well predicted voxels need not necessarily lie in known
retinotopic regions.Comment: 3nd International Workshop on Pattern Recognition in NeuroImaging
(2013
Easy over Hard: A Case Study on Deep Learning
While deep learning is an exciting new technique, the benefits of this method
need to be assessed with respect to its computational cost. This is
particularly important for deep learning since these learners need hours (to
weeks) to train the model. Such long training time limits the ability of (a)~a
researcher to test the stability of their conclusion via repeated runs with
different random seeds; and (b)~other researchers to repeat, improve, or even
refute that original work.
For example, recently, deep learning was used to find which questions in the
Stack Overflow programmer discussion forum can be linked together. That deep
learning system took 14 hours to execute. We show here that applying a very
simple optimizer called DE to fine tune SVM, it can achieve similar (and
sometimes better) results. The DE approach terminated in 10 minutes; i.e. 84
times faster hours than deep learning method.
We offer these results as a cautionary tale to the software analytics
community and suggest that not every new innovation should be applied without
critical analysis. If researchers deploy some new and expensive process, that
work should be baselined against some simpler and faster alternatives.Comment: 12 pages, 6 figures, accepted at FSE201
Hyperparameter optimization with approximate gradient
Abstract Most models in machine learning contain at least one hyperparameter to control for model complexity. Choosing an appropriate set of hyperparameters is both crucial in terms of model accuracy and computationally challenging. In this work we propose an algorithm for the optimization of continuous hyperparameters using inexact gradient information. An advantage of this method is that hyperparameters can be updated before model parameters have fully converged. We also give sufficient conditions for the global convergence of this method, based on regularity conditions of the involved functions and summability of errors. Finally, we validate the empirical performance of this method on the estimation of regularization constants of � 2 -regularized logistic regression and kernel Ridge regression. Empirical benchmarks indicate that our approach is highly competitive with respect to state of the art methods
- …